27 research outputs found

    LSTMS Compose — and Learn — Bottom-Up

    Get PDF

    Training dynamics of neural language models

    Get PDF
    Why do artificial neural networks model language so well? We claim that in order to answer this question and understand the biases that lead to such high performing language models---and all models that handle language---we must analyze the training process. For decades, linguists have used the tools of developmental linguistics to study human bias towards linguistic structure. Similarly, we wish to consider a neural network's training dynamics, i.e., the analysis of training in practice and the study of why our optimization methods work when applied. This framing shows us how structural patterns and linguistic properties are gradually built up, revealing more about why LSTM models learn so effectively on language data. To explore these questions, we might be tempted to appropriate methods from developmental linguistics, but we do not wish to make cognitive claims, so we avoid analogizing between human and artificial language learners. We instead use mathematical tools designed for investigating language model training dynamics. These tools can take advantage of crucial differences between child development and model training: we have access to activations, weights, and gradients in a learning model, and can manipulate learning behavior directly or by perturbing inputs. While most research in training dynamics has focused on vision tasks, language offers direct annotation of its well-documented and intuitive latent hierarchical structures (e.g., syntax and semantics) and is therefore an ideal domain for exploring the effect of training dynamics on the representation of such structure. Focusing on LSTM models, we investigate the natural sparsity of gradients and activations, finding that word representations are focused on just a few neurons late in training. Similarity analysis reveals how word embeddings learned for different tasks are highly similar at the beginning of training, but gradually become task-specific. Using synthetic data and measuring feature interactions, we also discover that hierarchical representations in LSTMs may be a result of their learning strategy: they tend to build new trees out of familiar phrases, by mingling together the meaning of constituents so they depend on each other. These discoveries constitute just a few possible explanations for how LSTMs learn generalized language representations, with further theories on more architectures to be uncovered by the growing field of NLP training dynamics

    Understanding Learning Dynamics Of Language Models with SVCCA

    Get PDF
    Research has shown that neural models implicitly encode linguistic features, but there has been no research showing \emph{how} these encodings arise as the models are trained. We present the first study on the learning dynamics of neural language models, using a simple and flexible analysis method called Singular Vector Canonical Correlation Analysis (SVCCA), which enables us to compare learned representations across time and across models, without the need to evaluate directly on annotated data. We probe the evolution of syntactic, semantic, and topic representations and find that part-of-speech is learned earlier than topic; that recurrent layers become more similar to those of a tagger during training; and embedding layers less similar. Our results and methods could inform better learning algorithms for NLP models, possibly to incorporate linguistic information more effectively.Comment: Accepted for publication in NAACL 201

    One Venue, Two Conferences: The Separation of Chinese and American Citation Networks

    Full text link
    At NeurIPS, American and Chinese institutions cite papers from each other's regions substantially less than they cite endogamously. We build a citation graph to quantify this divide, compare it to European connectivity, and discuss the causes and consequences of the separation.Comment: Workshop on Cultures of AI and AI for Culture @ NeurIPS 202

    Pareto Probing: Trading Off Accuracy for Complexity

    Full text link
    The question of how to probe contextual word representations for linguistic structure in a way that is both principled and useful has seen significant attention recently in the NLP literature. In our contribution to this discussion, we argue for a probe metric that reflects the fundamental trade-off between probe complexity and performance: the Pareto hypervolume. To measure complexity, we present a number of parametric and non-parametric metrics. Our experiments using Pareto hypervolume as an evaluation metric show that probes often do not conform to our expectations---e.g., why should the non-contextual fastText representations encode more morpho-syntactic information than the contextual BERT representations? These results suggest that common, simplistic probing tasks, such as part-of-speech labeling and dependency arc labeling, are inadequate to evaluate the linguistic structure encoded in contextual word representations. This leads us to propose full dependency parsing as a probing task. In support of our suggestion that harder probing tasks are necessary, our experiments with dependency parsing reveal a wide gap in syntactic knowledge between contextual and non-contextual representations.Comment: Tiago Pimentel and Naomi Saphra contributed equally to this work. Camera ready version of EMNLP 2020 publication. Code available in https://github.com/rycolab/pareto-probin

    Sparsity Emerges Naturally in Neural Language Models

    Get PDF
    Concerns about interpretability, computational resources, and principled inductive priors have motivated efforts to engineer sparse neural models for NLP tasks. If sparsity is important for NLP, might well-trained neural models naturally become roughly sparse? Using the Taxi-Euclidean norm to measure sparsity, we find that frequent input words are associated with concentrated or sparse activations, while frequent target words are associated with dispersed activations but concentrated gradients. We find that gradients associated with function words are more concentrated than the gradients of content words, even controlling for word frequency.Comment: Published in the ICML 2019 Workshop on Identifying and Understanding Deep Learning Phenomena: https://openreview.net/forum?id=H1ets1h56

    Evaluating Informal-Domain Word Representations With UrbanDictionary

    Get PDF
    Existing corpora for intrinsic evaluation are not targeted towards tasks in informal domains such as Twitter or news comment forums. We want to test whether a representation of informal words fulfills the promise of eliding explicit text normalization as a preprocessing step. One possible evaluation metric for such domains is the proximity of spelling variants. We propose how such a metric might be computed and how a spelling variant dataset can be collected using UrbanDictionary
    corecore